fix(concurrency): batch-2 hazards C1, H1-H7, M1-M9 (17 fixes) by hongwei1 · Pull Request #2847 · OpenBankProject/OBP-API

hongwei1 · 2026-06-26T08:36:10Z

Summary

Fix 17 concurrency hazards identified in the batch-2 audit across consent
status transitions, Redis rate-limiting, mutable singletons, and business
status state machines.

Hazards addressed

Group	Hazards	Fix
C1	Bulk-payment batch-reference	Propagate claim failure before fan-out
H1, H2, H3, M5	Consent & UAC status races	Atomic conditional UPDATE (Doobie CAS)
H5, H6, H7, M8, M9	Mutable singletons & lazy init	Replace with TrieMap / @volatile
H4, M6, M7	Redis non-atomic ops	SETNX+EXPIRE → atomic SET NX EX
M1, M2, M3, M4	Business status machines & TR lock	Doobie CAS conditional UPDATE

Tests

Five ConcurrencyRace*Test suites added in code.concurrency; all pass
locally and in CI (8-shard build green).

Also included

Remove run_all_tests.sh (superseded by run_tests_parallel.sh which
mirrors the CI shard structure and injects required env vars)

…ards Add five ScalaTest suites under code.concurrency (tagged ConcurrencyRace) covering hazards found in the second codebase scan: - ConcurrentBulkPaymentRaceTest C1 - ConcurrentConsentStatusRaceTest H1, H2, H3, M5 - ConcurrentRateLimiterRaceTest H4, M6, M7 - ConcurrentMutableSingletonRaceTest H5, H7, H6, M8, M9 - ConcurrentBusinessStatusRaceTest M1, M2, M3, M4 Behavioural suites reproduce the race; structural suites (H6/M8/M9) assert the hardening primitive (volatile field / concurrent map) is present. Redis-backed suites self-skip when Redis is unavailable. Fixes land per file in follow-ups.

…e (C1) createTransactionRequestBulk ran claimBatchReference inside a bare Future and dropped the result. Two concurrent submissions with the same batch_reference both passed the earlier isBatchReferenceUsed check, then both proceeded — the losing INSERT hit the UniqueIndex and returned a Failure that was silently discarded, so a duplicate payment was fanned out. Claim the batch_reference BEFORE creating the parent transaction request or fanning out any payment, and surface the Failure via unboxFullOrFail (409) so the losing request aborts before charging.

… M5) checkAnswer / revoke / skip-SCA-accept all did find-then-saveMe with an in-memory status check, so concurrent requests could both pass the guard and both write: two correct answers both accepted (H1, H2), or a skip-SCA / challenge accept overwriting an explicit revoke (H3, M5). Replace the unconditional saveMe with guarded conditional UPDATEs (UPDATE ... WHERE id = ? AND mstatus = <guard>) via new DoobieConsentStatusQueries / DoobieUserAuthContextUpdateQueries. The loser of a race gets 0 affected rows and a Failure instead of a second success. revoke uses WHERE mstatus <> 'REVOKED' so an explicit revocation always wins.

…7, M8, H6, M9) - DynamicConnector.singletonObjectMap, SecureLogging.customPatternCache and APIUtil.connectorToEndpoint were scala.collection.mutable.Map, whose put/getOrElseUpdate are not thread-safe and can lose writes or corrupt the map during a concurrent resize. Switch to scala.collection.concurrent.TrieMap (same Map API, lock-free). - ObpLookupSystem.obpLookupSystem and ObpActorSystem.{obpActorSystem, northSideAkkaConnectorActorSystem} were plain vars with unguarded null-check-then-assign init. Mark @volatile and use synchronized double-checked init so a write publishes safely and the system is created exactly once.

…4, M6, M7) Add two atomic Redis primitives: setNxEx (SET key value EX ttl NX, the Jedis 2.9.0 five-arg overload) and incrementWithTtl (a Lua INCR + create-TTL). - H4: RateLimitingUtil.incrementCounter replaced its TTL-read-then-SET/INCR sequence with the atomic incrementWithTtl, so concurrent callers cannot lose increments or race the expiry. - M6: IdempotencyMiddleware.writeResponseKey used setex (unconditional overwrite); now setNxEx, so the first cached response for an idempotency key is immutable and a replay returns the correct body. - M7: IdempotencyMiddleware.tryAcquireLock used setnx + a separate expire (a crash between left a TTL-less key blocking all retries); now a single atomic setNxEx sets value and TTL together.

…s update (M1, M2, M3, M4) - M2 AccountAccessRequest.updateStatus had no terminal-state guard; M3 MappedAccountApplication and M4 MappedChallengeProvider.validateChallenge checked status/success in memory then saveMe. Concurrent requests could both write. Add conditional UPDATEs via new DoobieBusinessStatusQueries: action an access request only from INITIATED (M2), transition an application only from its loaded status (M3, optimistic CAS), and compare-and-set the challenge success flag from false (M4) so one challenge can never green-light a payment twice. NB: the challenge `Successful` field maps to column successful_c. - M1 Http4s510 updateTransactionRequestStatus now acquires lockTransactionRequest within the request transaction (mirroring Http4s400) so it cannot overwrite a status set by the racing challenge path.

Thread 0 was simulating the old unconditional .saveMe() write instead of calling DoobieConsentStatusQueries.conditionalStatusTransition, causing the test to demonstrate the pre-fix bug rather than verify the fix. Update to use the same conditional UPDATE path as the production Http4s310 code.

…l.sh)

…exity

…on status lock atomicallyLockTransactionRequest used .query[String].unique, which raises UnexpectedEnd when the transaction-request row does not exist. runUpdate runs it with unsafeRunSync, so the exception propagates; lockTransactionRequest's tryo converts it to a Failure box, the caller's .isDefined is then false, and booleanToFuture fires its default 400 before the handler reaches getTransactionRequestImpl — so an authorised caller hitting a non-existent request id got 400 'Failed to acquire transaction request lock' instead of 404. Switch to .query[String].option so a missing row returns Full(None) (query ran cleanly) and only a genuine DB/lock error yields a Failure. The Box .isDefined check in both callers (Http4s510 status update, Http4s400 challenge answer) now passes for a missing row, letting the downstream lookup return the correct 404; the 400 is reserved for real lock failures.

The skip-SCA auto-accept did MappedConsent.find(...).map(c => conditionalStatusTransition(...)). On an Empty box the map yielded Empty, which the for-comprehension's _ <- discarded silently, leaving the consent INITIATED with no error. The consent is created earlier in the same request so Empty is effectively unreachable, but a silent no-op is wrong. Unwrap the box with openOrThrowException so a missing consent surfaces as a clear internal error, preserving the prior fail-visibly contract while keeping the atomic guarded INITIATED -> ACCEPTED transition.

Four concurrency-guard rejection messages introduced by the recent conditional-UPDATE fixes returned plain English strings instead of OBP-XXXXX coded errors, breaking the API's error-code contract (clients match on the OBP- prefix). - AccountAccessRequest.updateStatus: reuse the existing AccountAccessRequestStatusNotInitiated (OBP-30278) — the same condition the caller already checks non-concurrently. - MappedConsent.checkAnswer: reuse the existing ConsentUpdateStatusError (OBP-35025). - MappedUserAuthContextUpdateProvider.checkAnswer: add UserAuthContextUpdateStatusError (OBP-30090, filling a gap in the 300XX block) — no existing constant covered this case. - Http4s510 updateTransactionRequestStatus lock gate: add TransactionRequestLockFailed (OBP-40050) — no existing constant covered this case.

… createConsent The conditional INITIATED -> ACCEPTED auto-accept was only applied to the v3.1.0 createConsent path; Http4s500 and Http4s510 still carried the verbatim pre-fix line .map(_.mStatus(ACCEPTED).saveMe()).head — both the concurrent-revoke resurrection race and the .head NoSuchElementException on a vanished consent remained live on the two current API versions. Add conditionalStatusTransitionByConsentId (keyed by consent id, no extra SELECT to obtain the row id) and use it at all three call sites. The M5 race test now exercises this exact shared helper. The v3.1.0 site also drops its redundant re-find of the consent it created moments earlier.

…n approval approveAccountAccessRequest granted view access BEFORE attempting the INITIATED -> APPROVED transition. When a concurrent rejection won the status race, the approver got a 400 but the already-committed grant stood: the target user kept view access on a request whose persisted status said REJECTED. Winning the conditional UPDATE first makes the CAS the single gate — the loser exits with no side effect.

…omically The Lua rewrite dropped the old implementation's recovery branch for a counter key stuck without an expiry (legacy writer, PERSIST, RDB restore dropping expiries): such a key incremented forever and permanently rate-limited its consumer. The script now recreates the window (count=1, fresh TTL) when it finds a key with TTL < 0. The script also returns {count, ttl} together, removing the separate TTL round trip per period per request — halving increment-phase Redis I/O on the hottest path — and eliminating the non-atomic TTL read that could observe a different window (or fabricate a default on failure). Connection boilerplate consolidated into a withJedis loan helper.

…n fails The batch_reference claim is deliberately committed before any payment runs, but a parent-TR creation failure (500) left the claim in place: every retry of a batch that never executed anything hit the 409 already-used guard, permanently burning the reference. Release the claim (scoped to the claiming transactionRequestId) before surfacing the 500 — at that point no payment has executed, so a retry is safe. The claim is intentionally never released after the fan-out, where payments may have partially executed.

…veMe The replaced Lift saveMe() path wrote updatedAt (CreatedUpdated trait) on every status change; the raw-SQL conditional UPDATEs did not, so rows exposed through JSON 'updated' fields (e.g. AccountAccessRequest v6.0.0) showed a status change frozen at the creation timestamp. All conditional UPDATEs now set updatedat = NOW() alongside the status.

…a no-op DoobieUtil.runUpdate silently falls back to a standalone auto-commit transactor when no request-scoped connection is available; a SELECT ... FOR UPDATE issued there acquires and immediately releases the row lock while still reporting success, so a background or scheduler caller would proceed believing it holds mutual exclusion. Expose hasRequestScopeConnection and log a loud warning from lockTransactionRequest when the lock cannot actually be held.

…cks and messages - MappedChallengeProvider: the 9-line compare-and-set block was copy-pasted into both userId match arms; extract markChallengeSuccessful and collapse the arms into one answer+user check (userId.forall), removing the duplicated wrong-answer branch too. - MappedConsent.revoke: use rows == 1 like every other id-keyed CAS site (>= 1 implied a multi-row possibility that an id-keyed UPDATE cannot have). - MappedAccountApplication: the CAS-loser branch reported 'already been accepted' (OBP-30314) even when the concurrent winner wrote REJECTED; use the generic UpdateAccountApplicationStatusError (OBP-30315) instead. - run_specific_tests.sh: drop stale references to the deleted run_all_tests.sh.

Three reliability holes let a failing run look green: - timeout exit 124 was unconditionally converted to success, so a shard whose JVM was hard-killed BEFORE tests completed still reported BUILD SUCCESS. Now 124 only counts as success when the shard log proves the tests finished (BUILD SUCCESS printed); otherwise it is a failure. - The verdict trusted only per-shard mvn exit codes. Add an authoritative surefire XML audit after the shards: any failure/error recorded in the reports fails the run and is listed by suite. Stale reports from earlier runs are deleted before the shards start so the audit (and the speed table) reflect only this run. - The ALL SHARDS PASSED / SOME SHARDS FAILED line was printed before the speed table, so piping through tail -N could truncate it. Print the verdict last, and also write PASS/FAIL to test-results/parallel/RESULT — a machine-readable signal that survives piping (a pipeline reports the LAST command's exit code, so './run_tests_parallel.sh | tail' returns tail's 0 even when the script exits 1).

Under json4s, JValue already extends scala.Product, so the implicit conversion JvalueToSuper (JValue => JvalueCaseClass) that the Lift-era ResourceDoc builder relied on never fires anymore — json.parse(...) now type-checks as a bare JValue without ever going through the wrapper. Whether exampleRequestBody ends up as a JvalueCaseClass or a raw JValue then depends on which other class initializes the formats/implicits first, making ConfirmationOfFundsServicePIISApiTest's .asInstanceOf[JvalueCaseClass] constructor-time cast fail intermittently (ClassCastException, suite aborts) depending on run ordering across shards. Wrap both example bodies in JvalueCaseClass explicitly at the ResourceDoc call site instead of relying on the dead implicit, and make the test's own body lookup pattern-match on both shapes so it no longer assumes one representation.

…CaseClass Same root cause as the earlier PIIS-only fix, extended to the other three v1.3 files: since the json4s migration, JValue already extends scala.Product, so the implicit conversion JvalueToSuper (JValue => JvalueCaseClass) that ResourceDoc construction relied on never fires. Every example body built via json.parse(...) across AIS, PIS, and SigningBaskets was therefore a raw JValue instead of the expected JvalueCaseClass wrapper — resource-docs serialization would reflect on it as a generic case class rather than taking the JvalueCaseClass special-case path, and any call site that asInstanceOf[JvalueCaseClass]'d the example body (as ConfirmationOfFundsServicePIISApiTest's constructor did) failed depending on class-init order. Wrap every example body explicitly at each ResourceDoc call site instead of relying on the dead implicit; drop the now-unused implicitConversions import.

# Conflicts: # obp-api/src/main/scala/code/api/berlin/group/v1_3/Http4sBGv13PIIS.scala

The local parallel runner could pass while CI failed (or mask what CI tests): - Run -pl obp-commons,obp-api like CI does. The shards previously ran obp-api only, so obp-commons' five util suites never executed locally; the shard-4 catch-all even added com.openbankproject.commons.* to its filter, where it matched nothing and -DfailIfNoTests=false turned that into a silent pass. - Inject OBP_DYNAMIC_CODE_SANDBOX_PERMISSIONS, mirroring the dynamic_code_sandbox_permissions line CI appends to test.default.props. Its absence was the actual cause of the three perennial local DynamicResourceDocTest failures (sandbox denied reflection/getenv in the native-execution scenarios) — not a machine quirk. - Harden the surefire audit: scan both modules' reports, count unparseable XMLs as failures instead of skipping them, report skipped/canceled tests, and fail when fewer than 2000 tests ran in total (a broken wildcardSuites filter otherwise runs nothing and reports a hollow green thanks to -DfailIfNoTests=false). First fully green local run after this: 2917 tests, 0 failures, 0 errors.

conditionalAccountAccessRequestStatus/conditionalAccountApplicationStatus (rowId -> accountAccessRequestId / accountApplicationId) and DoobieUserAuthContextUpdateQueries .conditionalStatusTransition (rowId -> userAuthContextUpdateId) took a generic rowId parameter that gave no hint which entity's id it expected. All call sites pass arguments positionally, so no callers change.

consentRowId sat next to the business identifier consentId (String, mConsentId column) in the same call sites, inviting confusion between the two. Rename to consentPrimaryKey, matching the existing PrimaryKey naming convention elsewhere in the codebase (authUserPrimaryKey, saveMetricsArchive's primaryKey param). Named-argument call sites in ConsentScheduler and ConcurrentConsentRaceTest updated to match.

…micDataIds The method returns DynamicDataAccess.DynamicDataId values, and the sibling method in the same trait already names that concept dynamicDataId (getAccessForRow). getReadableRowIds was the only outlier still using RowId terminology.

…el.sh Kept for anyone still typing ./run_all_tests.sh out of habit; it does no work of its own and just execs run_tests_parallel.sh with the same arguments.

sonarqubecloud · 2026-07-02T13:11:58Z

Quality Gate passed

Issues
10 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
3.9% Duplication on New Code

See analysis details on SonarQube Cloud

hongwei1 added 30 commits June 26, 2026 09:10

docs(concurrency): document batch-2 hazards (C1, H1-H7, M1-M9) and fixes

609a17b

chore: remove sequential test runner (superseded by run_tests_paralle…

4ee49bd

…l.sh)

refactor(context): extract processUacAnswer to reduce cognitive compl…

5a9c69d

…exity

merge: PIIS example-body json4s fix from worktree investigation

a2aada8

merge: extend PIIS example-body json4s fix to AIS/PIS/SigningBaskets

6c6935a

# Conflicts: # obp-api/src/main/scala/code/api/berlin/group/v1_3/Http4sBGv13PIIS.scala

chore: remove switch_to_java11.sh (unused, unreferenced)

19bc3cd

chore: restore run_all_tests.sh as a thin forward to run_tests_parall…

f3c59c9

…el.sh Kept for anyone still typing ./run_all_tests.sh out of habit; it does no work of its own and just execs run_tests_parallel.sh with the same arguments.

simonredfern merged commit 50faaaa into OpenBankProject:develop Jul 2, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(concurrency): batch-2 hazards C1, H1-H7, M1-M9 (17 fixes)#2847

fix(concurrency): batch-2 hazards C1, H1-H7, M1-M9 (17 fixes)#2847
simonredfern merged 31 commits into
OpenBankProject:developfrom
hongwei1:feature/concurrency-hazard-fixes-batch2

hongwei1 commented Jun 26, 2026

Uh oh!

sonarqubecloud Bot commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

hongwei1 commented Jun 26, 2026

Summary

Hazards addressed

Tests

Also included

Uh oh!

sonarqubecloud Bot commented Jul 2, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants